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Abstract 

Public health researchers often estimate health effects of exposures (e.g., pollution, 
diet, lifestyle) that cannot be directly measured for study subjects. A common strategy 
in environmental epidemiology is to use a first-stage (exposure) model to estimate the 
exposure based on covariates and/or spatio-temporal proximity and to use predictions 
from the exposure model as the covariate of interest in the second-stage (health) model. 
This induces a complex form of measurement error. We propose an analytical framework 
and methodology that is robust to misspecification of the first-stage model and provides 
valid inference for the second-stage model parameter of interest. 

We decompose the measurement error into components analogous to classical and 
Berkson error and characterize properties of the estimator in the second-stage model 
if the first-stage model predictions are plugged in without correction. Specifically, we 
derive conditions for compatibility between the first- and second-stage models that guar- 
antee consistency (and have direct and important real- world design implications), and 
we derive an asymptotic estimate of finite-sample bias when the compatibility conditions 
are satisfied. We propose a methodology that (1) corrects for finite-sample bias and (2) 
correctly estimates standard errors. We demonstrate the utility of our methodology in 
simulations and an example from air pollution epidemiology. 



1 Introduction 



We consider measurement error that results from using predictions from a first-stage statistical 
model as the covariate of interest (the exposure) in a second-stage association study. Regard- 
less of the exposure prediction model, there will be measurement error from the difference 
between predictions and the unmeasured true values. In contrast with standard measurement 
error models and the usual classification into classical and Berkson error, such predictions 
induce a complicated form of measurement error that is heteroscedastic and correlated across 



study subjects (Gryparis et al. 2009 Szpiro et al. 2011b). Our objectives are to characterize 



the effects of this error, to give guidelines for study design to minimize the impact, and to 
provide a correction method that reduces bias and gives valid confidence intervals. 
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There has been extensive research on measurement error (Carroll et al. 2006), but the 
statistical literature has only recently begun to deal with the present problem. For spatial 
exposure contrasts, we have generalized the standard categories by decomposing the mea- 
surement error into a Berkson-like component from smoothing the exposure surface and a 



classical-like component from variability in estimating exposure model parameters (Gryparis 



et al.||2009 Szpiro et al.||2011b Sheppard et al.||2011 ). We and others have also shown that the 



parametric bootstrap (or a computationally efficient approximation to the parametric boot 
strap) can be used to correct for the effects of measurement error ( Madsen et al.|20"08 Lopiano 



et al .112011 Szpiro et al.||2011b ). However, validity of these results depends crucially on having 



a correctly specified exposure model. A distinguishing feature of our methodology in this pa- 
per is that it is robust to misspecification of the mean and/or variance in the exposure model 
and still provides valid second-stage inference. 

We focus on two-stage analysis, as this is a common and practical approach when exposure 
is not directly measured. An alternative is a unified analysis in which the exposure model is 
a component of a joint model for the exposure and health data (e.g., Sinha et al. (2010) in 
nutritional epidemiology and Gryparis et al. (2009 ) in air pollution epidemiology), but this type 
of joint model has several difficulties. First, it presupposes that one has a correct (or at least 
nearly correct) exposure model; we argue that an exposure model can generally only capture 
a portion of the full exposure and should be treated in this light. Second, outlying second- 
stage data may infiuence estimation of the exposure model in unexpected ways, especially 
when the second-stage model is misspecified (noting at the same time that this feedback is an 
essential aspect of a coherent joint model and leads to increased efficiency). Third, the same 
exposure data are often used with multiple second-stage outcome data, and it is scientifically 
desirable to use the same predicted exposures across studies. Finally, exposure modeling can 
be computationally demanding, involving spatial and spatio-temporal prediction, so pursuing 



a two-stage strategy has practical appeal. For further elaboration of these points, see Bennett 



and Wakefield (2001), Wakefield and Shaddick (2006), Gryparis et al. (2009), Lunn et al. 



( |2009[ ), and [Szpiro et al.| ( |2011b| ). 

We and others have also evaluated standard correction methods such as regression cal- 



ibration, including using personal measurements as validation data (Gryparis et al. 2009 



Spiegelman 


2010; 


Szpiro et al. 


2011b 



Performance in the spatial setting is mixed, most 
likely because the error structure differs substantially from classical measurement error. The 
methodology we describe here directly accounts for the spatial characteristics of measurement 
error and relies on statistical estimates of uncertainty from the exposure model rather than 
validation data. 

A key application, and the one that motivates this work, is studying the health effects of 
chronic exposure to ambient air pollution. Long-term air pollution exposure has been linked 
with increased cardiovascular morbidity and mortality in prominent studies that form part of 



the basis for regulations with broad economic impact (Dockery et al. 1993 Pope et al. 2002 



Peters and Pope 2002). Early air pollution cohort studies focused on mortality (Dockery 



et al. 1993 Pope et al. 2002), while more recent work has shown associations with non- fatal 



cardiovascular events (Miller et al. 2007) and sub-clinical indicators of disease (Kiinzli et al. 
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risk of air pollution to any single individual is thought to be small, but there are important 
public health implications because of the large number of people exposed and the ability of 
governments to mitigate exposure through regulatory action (Pope et al. 2006). 

In air pollution studies, exposure modeling is motivated by the desire to estimate intra- 
urban (i.e., within a metropolitan area) variation in exposure, which is more difficult to quan- 
tify than inter-urban pollution contrasts. There are significant advantages to exploiting intra- 
urban contrasts, as this can increase statistical power to detect health effects, help rule out 
unmeasured confounding by city or region, and improve our ability to differentiate between 
the effects of different pollutants or pollutant components. Pollution data are typically avail- 
able from regulatory and research monitoring networks but not from long-term residential or 
personal monitoring of individuals participating in observational health studies, leading to a 
spatial misalignment problem. Typical exposure prediction models rely on monitoring data 
in a regression with geographically-varying covariates and smoothing by splines or kriging 



(Fanshawe et al. 2008 



Jerrett et al. 2005a Hoek et al. 2008; Su et al. 2009; Yanosky et al. 



2009 Szpiro et al. 2010 



Brauer 2010). Standard practice is to select an exposure model with 



good prediction accuracy, treat the predicted exposures as known, and plug them into a health 



model to estimate the association of interest without accounting for measurement error (Jer- 



rett et al. 


2005b 


Kiinzli et al. 


2005 


Puett et al. 


2009 


Kim et al. 


2009) 





While motivated by air pollution epidemiology, the core ideas in this paper have much 
broader relevance. Indeed, Prentice (2010) (in the 2008 RA Fisher lecture at the Joint Sta- 
tistical Meetings) states that "measurement error in exposure assessment may be a poten- 
tially dominating source of bias in such important prevention research nutrition and 
physical activity epidemiology." It is essential to better understand the implications of mea- 
surement error in a wide variety of applications in which one must first estimate exposure. 
These applications include (1) nutritional epidemiology, (2) physical activity epidemiology, (3) 
a environmental and occupational epidemiology, (4) exposure to disease vectors or infectious 
agents, and (5) two-stage analyses in functional data contexts. Statistical exposure models are 



commonly used in environmental and occupational epidemiology ( 


Dement et al. 


1983 


Preller 


et al. 


1995 


Stram et al. 


1999 


Ryan et al. 


2007 


Slama et al. 


2007 


), with kriging and land 



use regression particularly popular in air pollution research. More generally, proxy data are 
becoming increasingly available and a natural idea in many contexts is to model an exposure of 
interest given publicly available data. Such data could include remote sensing from satellites 
or large networks of inexpensive sensors deployed to measure physical phenomena. 

Section 2 presents our basic framework, a key feature of which is that it avoids the as- 
sumption that the exposure model is correct and instead projects exposure data into a lower 
dimensional space. Section 3 decomposes the resulting measurement error into Berkson-like 
and classical-like components. We derive conditions under which there is no bias from the 
Berkson-like error, although this component of the error still increases variability of second- 
stage effect estimates. These conditions have important real-world design and analysis implica- 
tions. We then derive asymptotic estimates of the bias and variance caused by the classical-like 
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error. Section 4 describes our measurement error correction approach, wherein we correct for 
bias from the classical-hke error using our asymptotic results and estimate the uncertainty 
by bootstrapping. Sections 5 and 6 present simulations and an example application to the 
Multi- Ethnic Study of Atherosclerosis and Air Pollution (MESA Air). 



2 Analytical framework 



2.1 Data-generating mechanism 

We develop an analytic framework for air pollution cohort study data that also more generally 
illustrates how one can use a measurement error paradigm to formalize two-stage analysis with 
a misspecified first-stage model. While previous work has modeled the spatial variation in air 
pollution as a random field (Gryparis et al. 2009 Szpiro et al. 2010 2011b), we regard the 



spatial surface itself as fixed and treat the data locations as stochastic. This avoids the philo- 
sophical difficulties inherent in attributing spatial structure of long-term average air pollution 
to a stochastic spatial process that would be different in a hypothetical repeated experiment. 
Long-term average air pollution concentrations over one or more years are predominately 
determined by fixed but complex climatological, economic, and geographic systems, so it is 
scientifically preferable to regard the unknown surface as deterministic. Thus, we condition 
on the fixed physical world in the time period of the study and consider a repeated sampling 
framework in which observations might have been collected at different locations according 
to a (not necessarily known) study design. In Section [7| we discuss the implications of this 
approach when considering shorter-term air pollution exposures. 

More formally, consider an association study with health outcomes Ui and corresponding 
exposures Xj for subjects i = 1, ... ,n at geographic locations Sj G M^, with additional health 



model covariates z,- 



including an intercept. Define /Xj = E{yi 



and consider the generalized linear model (GLM) (McCuUagh and Nelder 1989) 



(2.1) 



for a monotonic link function ^/'(/i), unknown /3 and /J^, and conditionally independent Our 
target of inference is the health effect parameter /5. 

We focus on the linear model setting with '?/'(/i) the identity and adopt the notation 



Di = Xil3 + Zif3^ + et 



(2.2) 



where conditional on covariates the are independent but not necessarily identically dis- 
tributed, satisfying E{ei) = 0. If the Xi and Zj were observed without error, inference for /3 
would be routine by ordinary least squares (OLS) and sandwich-based standard error esti- 
mates (White 1980). We are interested in the situation where the i/i and Zj are observed for 



all subjects, but instead of the actual subject exposures we observe monitoring data, x*, for 
j = 1, . . . , ra*, at different locations s*. 
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We emphasize that we regard the spatial locations Sj and s* of study subjects and monitors 
as realizations of spatial random variables. The locations are chosen at the time of the study 
design, and it is natural to regard them as stochastic in order to address the statistical question 
of how the estimates of /3 would vary if different locations were selected according to similar 
criteria. Thus, in our development we assume the Sj and s* are distributed in with unknown 
densities g{s) and h{s), respectively, and corresponding distribution functions G{s) and H{s). 
Throughout, we assume the subject locations are chosen independently of the monitoring 
locations. To simplify the exposition, we further assume in Section [3] and |4] that both sets of 
locations are i.i.d. It is straightforward to account for clustering of subject or monitor locations; 



see, for example, the simulation study in Section and the data analysis in Section [6j 
Conditional on the Sj, we assume the Xi satisfy 

Xi = $(si) + rji, 

with i.i.d. mean zero rji. The function $(s) is a deterministic spatial surface that is potentially 
predictable by covariates and spatial smoothing, and the rji represent variability between 
exposures for subjects at the same physical location. We assume an analogous model for 
the monitoring data at locations s*, with the same deterministic spatial field $(s*) and with 
instrument error represented by f]* having variance cr^*. 

Finally, we assume the additional health model covariates Zj satisfy 

Zi = e(s,) + Ci, 

where 0(s) = {9i{s) , . . . , 9p{s)) is a p-dimensional vector-valued function representing the 
spatial component of the additional covariates, and which includes the intercept, and the 
Ci = {(ill ■ ■ ■ Xip) are random p-vectors independent between subjects and independent of 
the r]i. Each component of has mean zero, but the components of are not necessarily 
independent of each other. To illustrate, one additional health model covariate might be 
household income, decomposed into spatial variation representing the socioeconomic status of 
the neighborhood and the residual variation between residences. 

2.2 Exposure estimation 

Standard practice is to derive a spatial estimator of exposure w{s) based on the monitoring 



data and then to use the w{si) in place of the Xi in (2.1) to estimate /3. We consider a hybrid 
land-use regression and regression spline exposure model. Thus, we let R(s) be a known 
function from to W that incorporates q geographic (land-use) covariates and r — q spline 
basis functions. If we knew the least-squares fit of the exposure surface with respect to the 
density of subject locations g{s), 

7 = argmin J ($(s) - R(s)^)^(iG'(s), (2.3) 
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it would be natural to approximate Xi by w(sj) = R(sj)7. Notice that we do not assume 
the spatial basis is sufficiently rich to represent all of the structure in $(s), so we allow for 
misspecification in the sense that $(s) 7^ w{s) for some s G M^, for any choice of 7. 

We do not know 7, so we will estimate it from the monitoring data by 7 and then use the 
estimated exposure w{si) = R(sj)7 in place of Xi. In particular, we derive 7 by OLS 

n* 

7 = argmin^(x*-R(s*)^)^ (2.4) 



=1 



Under standard regularity conditions ( White||1980 ), 7 is asymptotically normal and converges 



a.s. to 7* as n* — )■ 00, where 7* is the solution to (2.3) with H{s) in place of G{s). In 



SectionkU we discuss the implications of distinct reference distributions in (2.3) and (2.4) 



2.3 Notation for health effect estimates 



We will denote by /3„,n* the health effect estimates obtained from the OLS solution to (2.2) 
using w{si) estimated from n* monitoring locations in place of Xi, for study subjects i = 
1, . . . ,n. We will at times also consider the a.s. limits of Pn,n* as n or n* converges to infinity, 
denoted by /3oo n* and ^noo, respectively. 



3 Measurement error 

We decompose the measurement error Ui = Xi — w(sj) into Berkson-like and classical-like com- 



ponents (Szpiro et al. 2011b). We define w*(sj) = R(sj)7* and introduce the decomposition 

: [xi - W* {Si)) + [w* iSi) - w{Si)) 

■ UtBL + Ui^CL- (3.1) 



The Berkson-like component, Ui^BL, is the information lost from smoothing even with unlim- 
ited monitoring data, and the classical-like component, Ui^cL, is variability that arises from 
estimating the parameters of the exposure model based on monitoring data at n* locations. 



3.1 Berkson-like error {ui^Bi) 

The designation of Ui^sL as Berkson-like error refers to the fact that this is part of the true 
exposure surface that our model is unable to predict, even in an idealized situation with 
unlimited monitoring data. As such, it results in predictions that are less variable than truth. 
We isolate the impact of Ui^BL by operating in the n* = 00 limit with w*{si) = R(si)7* and 
analyzing the asymptotic behavior as n — )■ 00. 

If the exposure model is correctly specified, then $(sj) = R(si)7* for every Sj and Ui^BL = Vi 
is pure Berkson error that does not induce bias in a linear health model and induces little 
bias in other models (Carroll et al. 2006). In general, $(sj) 7^ R(sj)7* for some Sj, but we 
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can specify spatial compatibility conditions under which Ui^BL still does not induce bias in 
estimating (3, asymptotically as n — )• oo. 



Condition 1. The probability distribution 0/ R(s) is the same if s is sampled from G{s) or 
His). 



Condition 2. The span o/R(s) includes the elements of@{s), 6k{s), k = 1, 
structured components of the additional health model covariates. 



, p, the spatially 



Note that Condition [T] is satisfied if the probability distributions of subject and monitor 
locations are identical, i.e., g{s) = h{s) for all s. Condition |2] implicitly requires that 0(s) 
be defined at all locations in the supports of g{s) and h{s). If g{s) = h{s) for all s, then this 
requirement is automatically satisfied since 0(s) is defined at all locations where it is possible 
for study subjects to be located. 

The following lemma holds under sufficient regularity of g{s), h{s), $(s), and R(s). We 
include the proof here because it is helpful for understanding the importance of the foregoing 
spatial compatibility conditions. 



Lemma 1. Assuming Conditionslll and 2, f3n,oo converges a.s. to (3 as n ^ 00. 



Proof. It is easy to see that /3„^oo is the OLS solution to (2.2) using w*(sj) = R(sj)7* in place 
of Xi. Condition [1] implies 7* = 7, so we consider the impact of using w{si) = R(sj)7 as the 
exposure. We write 



Vi = R(si)7/3 + Zi(3, + ((<l>(si) - R(s,)7) (3 + r]if3 + Ei 



(3.2) 



where the three terms grouped in parentheses are regarded as unobserved error terms. To 
apply Lemma 1 from White ( |1980 ), it is sufficient that 

E {R(si)7 X ($(si) - R(si)7)} = (3.3) 

and for each k = 1, . . . ,p 

E {Okis,) X ($(s,) - R(s,)7)} = 0, (3.4) 

where the random sampling of Sj is according to the density of subject locations g{s). Orthogo- 
nality of residuals in the least squares optimization for 7 in (2.3) implies (3.3), and Condition|2] 
implies (3.4) since each 6k{s) can be represented as a linear combination of elements of R(s). 
We actually need (3.4) with = 6k{si) + Qk in place of 6k{si) for Lemma 1 of White (1980), 
but this follows from (3.4) since (ki has mean zero and is independent of Sj. □ 

We comment on the necessity of Conditions [T] and [2j The proof of Lemma [T] depends 
on 7* = 7. This will always hold if $(s) is spanned by the R(s), but otherwise we rely on 
Condition |ll If 7* 7, then (Q becomes 



= R{si)Y(3 + Zil3, + (($(si) - R(s,)7*) (3 + v^(3 + ^i) ■ 



(3.5) 
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We cannot expect that R(sj)7* is orthogonal to ($(sj) — R(sj)7*) when Sj is drawn according 
to the probabihty density g{s), since 7* is the least squares fit from (2.3) with H{s) in place 



of G{s). Therefore, treating ($(Sj) — R(sj)7*) as part of the random variation in (3.5) results 
in the equivalent of omitted variable bias when estimating /3. 



Condition [2] is needed to guarantee ( |3.4 ) in the proof of Lemma [T} The difficulty if (3.4) 
does not hold is that ($(sj) — R(sj)')') in (3.2) may be correlated with one or more elements of 
the 0(sj) component of Zj. Intuitively, this can introduce bias because estimation of /3 relies 
on the variation in R(sj)7 that is unrelated to the covariates Zj, i.e., the residual variation 
after projecting onto the span of the elements of Zj, which is equivalent to the span of 0(sj). 
Without (3.4), the residual term ($(si) — R(sj)7) in (3.2) need not be orthogonal to this 



variation. Note that the need to include the covariates from the health model in the exposure 



model is analogous to the inclusion of covariates in standard regression calibration (Carroll 



et al. 2006) 



3.2 Classical-like error {ui^ch) 



Classical- like measurement error, Ui^cL results from the variability of 7 as an estimator of 
7*. As discussed by Szpiro et al. (2011b), it is similar to classical measurement error in the 



sense that it contributes additional variability to exposure estimates that is not related to 
the outcome. Like classical measurement error, Ui^cL introduces bias in estimating /3 and 
affects the standard error, but it is not the same as classical measurement error because it is 
heteroscedastic and shared between subjects. 

We will isolate the impact of ui^ql by operating in the n = 00 limit, corresponding to the 
entire superpopulation of study subjects, and analyzing the asymptotic properties of /3oo,n*- 



The exposure model parameter vector 7 is asymptotically normal (as discussed in Section 2.2 ) 
with dimension fixed at r, and Poo,n* is a deterministic function of 7, so under the conditions 
of Lemma [1] a standard delta-method argument can be used to establish that /3oo,n* is asymp- 
totically normal with mean (5. In particular, this implies that bias from classical-like error is 
asymptotically negligible in the sense that it is of comparable magnitude to the variance. This 
situation contrasts with classical measurement error where there are as many random error 



terms as observations and there is large-sample bias (Carroll et al. 2006). 



Even though the bias term is asymptotically negligible, our simulation studies suggest 
that it can still be important for moderate size n*, so we will derive a bias correction. Since 
only the variability in the exposure estimate that is orthogonal to covariates from the health 
model plays a role in deriving Poo.n*, it is helpful in the following analysis to define R'^(s) with 
elements -R^(s) = Rk{s) — @{s)il)}^, where -j/'a; = argmin^ J (-Rfc(s) — ©(s)ci;) dG{s). Analogous 
to w{s) and w{s), we define t&^(s) = R'^(s)7 and w'^{s) = R^(s)7. 

Note that the expectation of (3oo,n* need not be defined for finite n* because it is a function 
of 7 , and the denominator in the OLS solution for 7 is not bounded away from zero. Therefore, 
we adapt the definition of asymptotic expectation for a sequence of random variables from 
page 135). The basic idea is to identify the highest order term in a power 



Shao (2010 



series expansion that has non-zero expectation as the asymptotic expectation. See a related 
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discussion of concepts of asymptotic bias in 



Lumley 



(2010 



Appendix A. 1.2). 



Definition 1. Let vi,V2, ■ ■ ■ be a sequence of vector-valued random variables and let ai,a2, ■ ■ ■ 
be a sequence of positive numbers such that lim^^oo Q-n = oc. (i) Suppose v is such that 
E\v\ < oo and we can write Vn = Vn + v'^ with E{vn)=0 and Imin^oo dnv'n -^d v. Then 
we denote £'[a„](i^n) = E{v) and call _E[a^](i;„)/a„ an order asymptotic expectation of 
Vn- (a) Suppose V is such that Cov{v) < oo and lim^^oo \/an^n ^- Then we denote 
Cov[a„]{vn) = Cov(i;) and call Cov[a^](i;„)/a„ an order asymptotic covariance of Vn- 

Lemma 2. Assume sufficient regularity of g{s), h{s), $(s), and R(s) and the same spatial 
compatibility conditions as in Lemma\^ If we set 



Eyn']0o 



/3 



/ w'=(s)E[„.] (^^=(8) - w=(s)) dG{s) J Var[„,] (u;'=(s)) dG{s) 



Jw^{sydG{s) 



/w'=(s)2dG(s) 



J W^(Si)w^(S2)C0V[n*] {w''{Si),W%S2))dG{Si)dG{S2] 

{J w^{sydG{s)Y 



+ 



(3.6) 



and 



Var[„.](/3oo,„*) 



I / w^(si)m;^(s2)Cov[„«] (w^(si),w^(s2)) c^G(si)(iG'(s2) 



{J W^isydGis))' 



}, (3.7) 



then -E[„*](/3oo,n* ~ is an asymptotic expectation of j3o 

asymptotic variance of Poo,n* (both of order n*~^) . 



■ fj and Var[„*](/3oo,n* )/'"'* 



The proof is outlined in Appendix |A| where we express /3oo,n* as a function of 7 and do 
a second order Taylor expansion around 7. Definition [l] is required to define the order n*~^ 
asymptotic expectation in the first term of (3.6), which is a linear function of 7 — 7. The 
first order terms in a Taylor expansion of 7 — 7 are of order n*"^/^ and do not converge when 
multiplied by n* . However, they have expectation zero, so they play the role of Vn and do not 



contribute to the asymptotic expectation. See (3.9) and the surrounding discussion. 



The practical import of Lemma pi is that we can use (3.6) to correct for the bias from 



classical-like error. The variance estimate in (3.7) is not directly useful as a standard error 



because it does not include variability from Berkson-like error or from having n < 00 study 
subjects, but it provides insight into the relative magnitudes of bias and variance from classical- 
like error. 

To estimate (3.6), we can estimate w'^{s) by w^{s) = R'^(s)7, noting that R^(s) is approx- 



imated from the observed exposure covariates for the health observations, orthogonalizing 
with respect to the health model covariates, which is the finite sample approximation to the 
construction of R'^(s) in Section 3.2[ To estimate the variances and covariances of w)'^(s), we 



use a robust estimator for Cov[„.](7) (White 1980 Carroll et al. 2006). We use the sandwich 



estimator to avoid the assumption of having a correctly-specified model, as required for the 
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standard model-based estimator. Given these estimators, all the integrals in the first two 



terms of (3.6) can be estimated as averages with respect to the discrete measure with equal 



weight on each health observation, the standard plug-in estimator for G{s) 



Finally, in the third term of (3.6), we need to estimate E[n*]{w^{s)—w'^{s)) = R(s)i?[„.](7 — 
7), and therefore the asymptotic expectation of 7, which is non-standard and is developed 
as follows. Let $* be the vector comprised of the $(s*) and R* the n* x r matrix obtained 
by stacking the R(s*) for j = 1, . . . ,n*. For arbitrary rrij, denote by M the n* x n* diagonal 
matrix with entries mi, . . . , m„*. If we set rrij = 1/n* for j = 1, . . . ,n* and define 



fi:(mi,...,m„.) = (R*'MR*) ^R*'M$*, (3.8) 

then we notice E {jlsl, . . . , s*,) = ^(mi, . . . , m„.). We are interested in the unconditional 
expectation of 7. Heuristically, we assume that the true h{s) is supported on the observed 
monitor locations and gives equal weight to each observation (i.e., we use the plug- in estimator 
for h{s)). In that case, a realization of s^, . . . , s* , can be expressed as a multinomial draw, 
mi,...,m„*, where the rrij are the fraction of times each location in the support of h{s) 
is drawn. We can estimate the expectation of /t(mi, . . . ,mn*) by means of a Taylor series 
expansion of around rrij = ^ for j = 1, . . . ,n*. Using the first and second moments of a 
multinomial distribution, we have 



M (7-7) ~ 2 1 " ] ~ 2(^ 2^ g^ Q ■ (3.9) 

^ ^ ^ / j = l J ^ ' j,k=l;j^k ^ 

It easy to see that the first order terms in the Taylor expansion of «;(■) (not shown) have expec- 
tation zero, so they play the role of Vn in Definition [T] and do not contribute to the asymptotic 
expectation. We give further details on numerical calculation of the above expression in Ap- 
pendix [Bj A more formal derivation that does not begin by assuming a discrete distribution 
could be developed by a von Mises expansion with the empirical process of monitor locations 
( [van der Vaa7t||1998[ Sect ion 20.1). Note that although we do not observe the $(s*), replacing 



them with x* in~(5^ does not introduce bias since x* = ^{s*)+r]*, and the rj* are independent 



of everything else and have mean zero. 

Finally, we can gain additional insight into the bias and variance contributions from 
classical-like error by considering the simplified situation in which the exposure model is 
correctly specified so that w{s) = R(s)7 for all s, the subject and monitor location densities, 
g{s) and h{s), are the same, and there are no additional covariates or intercept in the health 
model. In that case it is easy to show that the asymptotic expectation simplifies to 

1 1 (r-2)(T2 

-i?M(/3oo,n*-/3) = / . y (3.10) 

and the asymptotic variance simplifies to 

2 



n n 



fw{sydG{s)' 
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The r — 2 term in (3.10) illustrates the fact that the bias is away from the null in the case of 
a one-dimensional exposure model and that more typically it is toward the null and becomes 
larger with higher-dimensional exposure models, for a given true exposure surface. This is 
what occurs empirically in our simulations and examples. 
In addition, the ratio of the squared bias to the variance is 

^ ' (3.12) 



jw{sYdG{s) 



which demonstrates that the importance of the bias depends on the dimensionality of the 
exposure model relative to the sample size and the ratio of the noise to the signal in the 
exposure data. 



4 Measurement error correction 

We correct for measurement error by means of an optional asymptotic bias correction based 



on (3.6) followed by a design-based nonparametric bootstrap standard error calculation (in- 
corporating the asymptotic bias c orrec tion in the bootstrap, if appropriate). 

Given a bias estimate h from ( |3.6 ) the bias-corrected /S^c is /3/(l + h). Bias correction is 



optional since the asymptotic results of Section [3] show that the naive health effect estimator 
is consistent, with variance dominating the bias in the limit as the number of exposure obser- 
vations increases. We explore the magnitude of bias and utility of including the asymptotic 
correction via simulation in the next section and comment further on this topic in Section [7} 
We need to estimate the uncertainty in either /3 or /J^c in a way that accounts for all the 
components of the measurement error and the sampling variability in the health model. Note 



that the asymptotic variance (3.7) accounts only for the variance from the classical-like mea- 
surement error. Since we have assumed that the locations of health and exposure data are 
randomly drawn according to the densities g{s) and /i(s), respectively, a simple design-based 
nonparametric bootstrap is a suitable approximation to the data-generating mechanism. To 
obtain each bootstrap dataset, we separately resample with replacement n* exposure measure- 
ments and n health observations. We fit the exposure model to the bootstrapped exposure 
measurements and use the results to predict exposures at the locations of the bootstrapped 
health observations. We then obtain bootstrap health effect estimates (with or without bias 
correction) and estimate the standard error of /3 or /3bc by means of the empirical standard 
deviation of these values. 



In principle, we could avoid the asymptotic calculations in (3.6) by employing a bootstrap 
procedure to estimate bias followed by a second round of bootstrapping for standard error 
estimation. Such a nested bootstrap is computationally demanding. Furthermore, our strategy 



of using the bootstrap after addressing the bias is consistent with the comments of (Efron 



et al. 1993, Chapter 10) who caution that bias correction with the bootstrap is more difficult 



than variance estimation. Along similar lines, (Buonaccorsi 2010, p. 216) notes the need for 



additional assumptions when developing a two-stage bootstrap that includes bias correction. 
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5 Simulations 



5.1 One dimensional exposure surface 

Our first set of simulations is in tlie simplified setting of a one dimensional exposure surface. 
In this setting, we illustrate the bias from Berkson-like error for very large n* when either 
Condition [T] or [2] is violated, and we illustrate the finite n* measurement error correction 
methods from Section |4] when both compatibility conditions are satisfied. We simulate 1,000 
Monte Carlo datasets and use 100 bootstrap samples, where applicable. 

The true health model is linear regression with /3 = 1, with i.i.d. e ~ A^(0, 1) and an 
intercept but no additional health model covariates. We use n = 500 subjects. The true 
exposure surface on (0,10) is a combination of low frequency and high frequency sinusoidal 
components 

<l>(s) = sin(s + 3.5) + sin(4s - 10.5), 

20 

and we set = a^, = 0.5. The density of monitor locations is 

^ _/ 0-142 0<s<f,f<s<10 , . 

'^^^>-\ 0.0142 f <s<'f, ^^-^^ 

and we use an exposure model R(s) comprised of a B-spline basis with 5 to 25 degrees of 



freedom (Hastie et al. 2001). 

To illustrate the bias from the Berkson-like error when either of the compatibility conditions 
is violated, we set n* = 1000 so that the classical-like error is negligible. The results of these 
simulations are shown in Figure [l} In panels (a) and (b), the health model is fit with an 
intercept but no additional health model covariates, so Condition |2] is automatically satisfied. 
In panel (a) the density of subject locations g{s) is the same as h{s), and there is no evidence 
of bias in /3. In panel (b), g{s) is uniform on the interval (0, 10) so that ConditionSis violated. 
There is clear evidence of bias away from the null for 5 and 9 df exposure models. There is no 
evidence of bias with 13 df, which can be attributed to the fact that the exposure model with 
13 df is sufficiently rich to account for almost all of the spatial structure in $(s), meaning that 
the Berkson-like error behaves like pure Berkson error. 

In panels (c) and (d) of Figure [l| g(s) is the same as h(s), but we fit the health model 
including an additional covariate Zi = sin(sj). In panel (c), this covariate is also included 
in the exposure model, and as expected we see no evidence of bias in (3. In panel (d), the 
additional covariate is not included in the exposure model, so Condition |2] is violated. There 
is noticeable bias of /3 toward the null, especially for the 5 and 9 df spline models. 

In Figure [2| we show results from a separate set of simulations with n* = 200 in order to 
illustrate the measurement error correction methods from Section |4| In these simulations, ^'(s) 
is the same as h(s) and the health model is fit without additional covariates, so Conditions [T] 
and [2] are satisfied. The mean out-of-sample ranges from 0.25 for 5 df to 0.35 for 13 df, 
corresponding to the challenging situation of an exposure model with marginal performance 
that can lead to substantial bias in estimating /3. In panel (a), we see that the uncorrected 
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health effect estimates have notable bias, especially for larger df exposure models, and our 
correction successfully removes most of the bias. Panel (b) shows the coverage of nominal 
95% confidence intervals. In the uncorrected analyses, coverage ranges from 45% to 80%, 
depending on the df in the exposure model. Confidence intervals that incorporate either the 
bias correction or bootstrap standard errors improve the coverage. We obtain nearly perfect 
95% coverage when we incorporate the bias correction and bootstrap standard errors. 



5.2 Spatial exposure surface 



Our second set of simulations is based on the MESA Air study design in the Baltimore region 
(Section 6]), using 1,000 simulated datasets and 100 bootstrap samples. We enforce Con- 
ditions [1 and |2] and focus on illustrating the value of the correction methods described in 
Section |4] in a realistic spatial setting. The spatial domain is a 257 x 257 discrete grid scaled 
to be a square 30 units on a side. There are n* = 125 monitor locations, sampled in clusters 
by first choosing 25 locations i.i.d. uniformly on S and then also including the four nearest 
neighbors for each such location. Our bootstrap for these simulations resamples clusters of 
five monitors. A total of n = 600 subject locations are selected uniformly and independently 
from S. 

The predictable part of the exposure surface is 

$(s) = 70 + 7i^i(s) + 72^2(s) + 73i?3(s) + $i(s). 

Each 7j = 4.9, and each Rk{s) is constructed by drawing i.i.d. realizations from A^(0, 1/3) 
at each s G 5*. $i(s) is a fixed realization from a spectral approximation to a Gaussian field 



with Matern covariance (Paciorek 2007) with range 20 and unit differentiability parameter, 
normalized such that the variance of $i(s) on S is 30. Thus the total variance of $(s) on S 
is approximately 54. In the true exposure surface and monitoring data, there is also a nugget 
with variance cr^ = cr^, = 6. We consider two spatial scenarios, corresponding to different 



fixed realizations of $i(s). These surfaces are shown in Figure |3| 

The spatial exposure model has R(s) comprised of -Ra;(s) for k = 1,2,3 and a thin-plate 



spline basis derived by fitting a GAM from the MGCV package in R (Wood 2006) to the 



observed monitoring data with fixed degrees of freedom (df). Thus the spatial basis is actually 
different for each simulated dataset since it depends on the monitor locations, but we keep 
the same basis functions for the bootstrap analysis within each simulation run. We estimate 
the standard error of 7 using a sandwich estimator for clustered data implemented in the R 



package geepack (Hojsgaard et al. 2006). 



The true and fitted health models have an intercept but no additional covariates. We set 
P = 0.1 and consider i.i.d. normally distributed e ~ A^(0,ct^) with cr^ equal to 200 or 10. 
The larger value of = 200 is consistent with what we see in the MESA Air data with 
left ventricular mass index (LVMI) as the outcome, where the air pollution exposure explains 
approximately 0.3% of the variance after adjustment for known risk factors. We also consider 
CTg = 10 such that air pollution exposure explains approximately 5% of the health outcome 
variance in order to see more clearly the potential impact of exposure measurement error. 
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The two spatial surfaces, while generated at random, represent different deterministic 
scenarios in which we could find ourselves (e.g., different metropolitan areas). In scenario 1 
the spatially structured part of the air pollution surface $i(s) can be represented fairly well 
using thin-plate splines with either 5 or 10 df, while the spatial surface in scenario 2 cannot 
be represented well with 5 df but can be reasonably well modeled with 10 df. This is reflected 
in the values in Figure [s} which represent the best thin-plate spline fits to the surfaces, 
assuming essentially unlimited monitoring data is available. The cross-validated and out-of- 
sample values for predicting the full air pollution surface $(s) (including non-spatially 
structured covariates) based on monitoring data reported in Table [l] exhibit a similar pattern. 
For leave-one-out cross-validation, the clusters of five adjacent monitors are treated as a single 
unit. 

We focus our discussion on the scenarios with = 10 because this is where the impact of 
exposure measurement error is most prominent. The measurement error impact is qualitatively 
similar for = 200, but it is less important because the unmodeled variability in the health 
outcome dominates. 

When we fit the exposure model with a 5 df thin-plate spline, there is modest bias toward 
the null of 3% in scenario 1 and more substantial bias of 12% in scenario 2. Our asymptotic 
correction reduces the magnitude of bias in both instances. The bias correction followed by 
bootstrap standard errors consistently gives valid inference, including accurate standard error 
estimates and nominal coverage of 95% confidence intervals. In scenario 1, we also get valid 
inference with bootstrap standard errors and no bias correction. 

When we increase the complexity of the spatial model to 10 df, prediction accuracy im- 
proves in both scenarios, but inference about the health effect parameter is degraded. The 
magnitude of bias is approximately the same as with 5 df, but our asymptotic correction is less 
effective. Furthermore, the bootstrap standard error estimates tend to be too large, resulting 
in over-coverage of 95% confidence intervals. These findings are not surprising, because while 
each simulated dataset has 125 monitor locations, they are clustered in groups of 5 so that 
there are effectively only 25 unique locations for estimating the smooth component of the 
spatial surface, so a thin-plate spline model with 10 df overfits these data in the sense that we 
do not expect to be able to rely on large n* asymptotic approximations such as (3.6) or the 
nonparametric bootstrap. 



6 Data analysis 



The Multi-Ethnic Study of Atherosclerosis and Air Pollution (MESA Air) is an ongoing cohort 
study designed to investigate the relationship between air pollution exposure and progression 
of subclinical atherosclerosis (Bild et al. 2002; Kaufman et al. |2012 ). The MESA Air cohort 
includes over 6,000 subjects in six U.S. metropolitan areas (Baltimore City and Baltimore 
County, MD; Chicago, IL; Forsyth County (Winston-Salem), NC; Los Angeles and Riverside 
Counties, CA; New York and Rockland County, NY; and St Paul, MN). Four ethnic/racial 
groups were targeted, white, African American, Hispanic, and Chinese American, and all study 
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participants (46 to 87 years of age) were without clinical cardiovascular disease at the baseline 
examination (2000-2002). An early cross-sectional finding from MESA Air is that an elevated 
left-ventricular mass index (LVMI) is associated with exposure to traffic related air pollution, 



specifically outdoor residential concentrations of gaseous oxides of nitrogen (NOx) (Van Hee 



et al. 2009 2012[ ). Van Hee et al. (2012) found that an increase in NOx concentration of 10 
parts per billion (ppb) is associated with a 0.36 g/rn? increase in LVMI (95% CI: 0.02 - 0.7 
g/w?). 



Van Hee et al. (2012) utilized predictions from a spatio-temporal exposure model that 



incorporates regulatory and study-specific monitoring data in all six regions (Szpiro et al. 



2010). To illustrate our methodology for a purely spatial exposure model, we re-analyze the 



data restricted to subjects in the Baltimore region, and we construct an exposure model based 
on data from three community snapshot monitoring campaigns conducted by MESA Air. 
In brief, the community snapshot campaign consisted of three separate rounds of spatially 
rich sampling during single two-week periods in different seasons. In the Baltimore area, 
approximately 100 measurements were made in each of three two-week periods in May 2006, 
November 2006, and February 2007. In each round of snapshot monitoring, the majority 
of monitors were arranged in clusters of six, with three on either side of a major road at 
distances of approximately 50, 100, and 300 meters (Cohen et al. 2009). In addition, the 
locations were chosen to characterize different land use categories and to cover the geographic 
region as broadly as possible. To help with satisfying Condition [T| we exclude one cluster from 
our analysis because it is far from any of the study subjects, and we approximate long-term 
average concentrations by averaging the three available measurements at locations that were 
monitored in all three seasons. The 93 monitor locations and 625 subject locations in our 
analysis are shown in Figure |4j 

Our exposure model incorporates five geographic covariates: (i) distance to a major road, 
(ii) local-source traffic pollution from a dispersion model ( Wilton et al.||2010 ), (iii) population 
density in a 1 km buffer, (iv) distance to downtown, and (v) transportation land use in a 1 
km buffer. The first three of these geographic covariates are log-transformed. An additional 
covariate describing the density of high-intensity land-use (commercial, industrial, residential. 



etc.) was also incorporated in the original spatio-temporal model predictions used by (Van Hee 



et al. 2012), but we exclude this covariate from our model because it has very different distri- 



butions across subject and monitor locations, a clear violation of Condition [T} To account for 
unmodeled spatial structure, we use a thin-plate spline basis with 0, 5, or 10 df, constructed 
as in the simulations. We estimate the standard error of 7 using a sandwich estimator for 
clustered data implemented in the R package geepack (Hojsgaard et al. 2006). We estimate 



the association between NOx and LVMI by fitting a multivariate linear regression, including 
an exhaustive set of additional health model covariates that could be potential confounders 
dVan Hee et"aL|[2009| ) . 

The results of our analysis are shown in Table |2| with 10,000 bootstrap replicates (resam- 
pling clusters of monitors, where applicable). Our findings are very similar for an exposure 
model that is purely land-use regression and one that includes splines with 5 df. We esti- 
mate that an increase in NOx concentration of 10 parts per billion (ppb) is associated with 
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approximately a 0.7 g/m?' increase in LVMI. Our standard error estimates for these models 
in Table [2] rang e from 0.55 to 0.68 gjvn? ^ so the difference in effect size from that found by 



Van Hee et al. (2012) is very likely due to our more limited dataset. The exposure model 
that includes 5 df splines has a larger cross- validated i?^, suggesting that it captures more 
variability in the exposure. This translates into a smaller model-based standard error, but 
this apparent advantage is attenuated when we correct for exposure measurement error with 
bootstrap standard error estimates, and it goes away entirely when we also incorporate the 
bias correction. 

The exposure model with 10 df gives slightly larger effect estimates and standard errors. 
There is also more evidence of bias from classical-like error than for the lower dimensional 
exposure models. However, our simulation results in Table [l] suggest that a 10 df spline is too 
rich of a model for the available monitoring data and that these results should be considered 
less reliable than those based on 5 df splines. 



7 Discussion 

We have developed a statistical framework for characterizing and correcting measurement error 
in two-stage analyses, focusing particularly on problems where a first-stage spatial model is 
used to predict exposure that is measured at different locations than are needed in a second- 
stage health analysis. Our methodology is robust to misspecification of the exposure model, 
treating it as a device to explain some portion of the variability in exposure. We adopt a design- 
based perspective in which the process of selecting exposure measurement and subject locations 
is the primary source of spatial randomness, leading naturally to nonparametric bootstrap 
resampling for standard errors. A major contribution of our work is that we delineate the 
potential sources of bias from Berkson-like and classical-like measurement error and provide 
strategies for reducing bias and variance at the design and analysis stages. Bias from classical- 
like error can be corrected using an asymptotic approximation, whereas bias from Berkson-like 
error does not vanish asymptotically and needs to be addressed at the design stage or when 
selecting an exposure model. 

While our research is primarily motivated by epidemiologic analysis of long-term air pollu- 
tion health effects, we note that the spatial prediction problem can be interpreted as a linear 
model. Thus, our measurement error decomposition, asymptotic results, and bias correction 
hold equally well in non-spatial settings. 

Our theory and simulations demonstrate that bias from the classical-like error is small 
when the exposure model is not overfit in the sense that there are sufficient observations 
relative to the dimension of the exposure model for large n* asymptotic to be relevant. The 
limited magnitude of the bias suggests that measurement error correction efforts should focus 
on avoiding overfitting the exposure model and satisfying the conditions needed to ensure that 
Berkson-like error does not induce bias (at least in a linear health model). Nonetheless, in 
several simulation scenarios our asymptotic correction for bias from classical-like error results 
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in improved estimation and inference, even at the expense of increased variance. Indeed, in 
our analyses and simulations the increased variance caused by estimating the bias is modest. 

Our theoretical development motivates the use of a nonparametric bootstrap to account for 
variability induced by measurement error. When the bias correction is not used, simulations 
suggest that the underestimation of uncertainty from ignoring the measurement error (using 
a sandwich variance estimator) is modest, but even so there are cases in which accounting for 
the effect of measurement error is necessary. When we include the asymptotic bias correction, 
the bootstrap is more consistently necessary for valid confidence intervals. 

Several practical approaches can be considered in designing a study to minimize the bias 
from Berkson-like error. We will explore in detail the options and tradeoffs outlined below in 
future work. 

First, to satisfy Condition [T| as much CcirG clS possible should be taken at the design stage 
to ensure the sampling densities of locations and exposure covariates are as similar as possible 
in the first-stage exposure observations and the second-stage outcome observations. While 
this criterion is overly abstract in the context of a specific study, the practical implication is 
that first-stage and second-stage locations should be chosen to be similar in terms of location 
and pertinent covariates. If exposure data have already been collected, it may be necessary 
to consider excluding exposure or outcome data or deleting one or more covariates from R(s) 
in order to minimize the mismatch. 

If we are particularly concerned about Condition |2| we can add terms to R(s) to span 
0(s). We generally will not know 0(s) directly, but if we do (e.g., if household income were 
known and monitors were located at homes) then supplementing R(s) with 0(s) or projecting 
R(s) to make it orthogonal to 0(s) are equivalent. In most realistic settings, we will assume 
that 0(s) is a set of smooth functions of space that can be modeled by spline terms, but we 
will not know the minimal spanning spline basis. In this case it is preferable to supplement 
R(s) with as rich of a basis as possible without introducing substantial classical-like error. 
Projecting R(s) to make it orthogonal to a similarly rich spline basis would likely result in 
a significant diminution of exposure variability beyond what is needed to eliminate bias from 
not satisfying Condition |2} 

The possibility of adding dimensions to R(s) highlights the critical tradeoff between Berkson- 
like and classical-like error. Augmenting R(s) reduces Berkson-like error by accounting for 
more of the variability in w{s). Since eliminating Berkson-like error also eliminates the need 
to satisfy the asymptotic conditions, we generally expect that adding such terms will limit 
bias from the Berkson-like error. A side effect of augmenting R(s) is to change the sampling 
variability of 7, which impacts the classical-like error. This could be beneficial if the additional 
terms in R(s) account for a substantial amount of variability in w{s), since the result will be 
to reduce the variance of the original components of 7. On the other hand, if the coefficients 
for the new terms are difficult to estimate, the result will be a substantial new contribution to 
the classical-like error, leading to additional bias and variance in the second-stage estimation. 
In fact, in order to reduce classical-like error, one might choose to remove selected dimensions 
from R(s) if their coefficients are particularly difficult to estimate. 

There are some key assumptions in our model that may not be strictly satisfied in air 
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pollution epidemiology studies. First, we regard the sets of locations of exposure and health 
observations to be independent, or at least independent clusters. This assumption can be 
questioned, particularly in the case of air pollution monitors, as one would not expect a 
government agency to select two sites that are very close together. Second, a major source 
of exposure heterogeneity that we do not consider is the difference between exposure at a 
residence and the exposure experienced by individuals when they are not home. Mobility may 
be less important in studies of small children and the elderly, but this remains an open issue 
in the epidemiologic literature. 



Finally, as described in Section 2.1, we condition on the unobserved but deterministic 
spatial variation in exposure during the time period of the study. This avoids having to 
postulate that one could meaningfully repeat the experiment in other time periods. This is 
particularly important when the averaging period of interest is one or more years since secular 
trends in the nature and sources of air pollution limit the number of years during which air 
pollution studies can be regarded as answering analogous scientific questions. For shorter-term 
studies, there is additional variability associated with the choice of time period, and it would 
be reasonable to regard the different air pollution surfaces at different times as arising from a 
random spatial process. However, with data from only a single time period and a misspecified 
mean model, it is impossible to identify both the fixed and random components of the spatial 
residuals, so we do not incorporate a random effect in our formulation. 

Our measurement error correction is based on asymptotic approximations derived for linear 
regression for the exposure and health models. Real world applications often involve addi- 
tional complications, suggesting further research directions. On the exposure model side, our 
methods can be extended to penalized models and full-rank models such as universal kriging 
and related spatio-temporal models that are often used in environmental studies. Nonlinear 
models such as logistic regression and Cox regression are commonly used for the second stage 
in health studies, and it is also important to consider the implications of misspecification in 
the second-stage model, in addition to the exposure model. 

Two-stage analyses to date have taken the approach of optimizing the exposure model 
for exposure prediction accuracy, based on the implicit assumption that this will also lead to 
optimal second-stage health effect inference. In previous work we have shown that optimizing 
the exposure model for prediction accuracy can be sub-optimal for health effects estimation 
(Szpiro et al. 2011a). An interesting avenue for future research involves developing methods 
to optimize the exposure model for estimation of the health effect of interest in the second- 
stage model. A final direction for additional research that is of great interest in air pollution 
epidemiology is to extend these methods for measurement error correction when assessing 
health effects of multiple exposures or mixtures of exposures. When predictions for more than 
one exposure are used in a health model, there is the possibility of a form of omitted variable 
bias from components of variability that are missing from the predictions of the exposures. 
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A Proof of of Lemma 



2 



Proof of Lemma^ By definition, fi^^n* is the 'true' parameter value in a linear model for y 
on w{s) and covariates, z, so we can express 

for some (izoan* ^i^d an error term, z/j, that is orthogonal to w{si) and to the elements of Zj 
with respect to the density g{s). We can rewrite this expression as 



Vi = /3c 



(A.i: 



and note that each of the last three terms is orthogonal to w'^{si). Given this orthogonality, 
we can view the last three terms as a single error term. The OLS solution for this no-intercept 
linear model. 



converges a.s. to Poo,n* for any fixed n* based on Lemma 1 of 
each observation as 



White 



(1980). We can express 

Vi = Pw'is,) + P{w{si) - w^isi)) + /3($(si) - w{si)) + (3r]i + /3jz, + q (A.S) 



and plug in for yi in (A. 2). Taking the limit as n — )■ oo shows that the a.s. limit of (A. 2) can 
be expressed as 



/3o 



/3 



(A.4) 



^w^isfdGis) ' 

where the limiting value on the right hand side is found by dividing the numerator and 
denominator of (A. 2) by n and invoking the law of large numbers. In the numerator, only 
the first summand in the expression for yi gives a non-zero contribution. The contribution 
from the second summand is zero because (w(sj) — w'^{si)) is a linear combination of elements 
of 0(s) while w'^{s) is orthogonal to each component of 0(s). For the third summand, we 
use our assumption that Condition [2] is satisfied. Namely, w^{s) is a linear combination of 
elements of 0(s) and R(s), so it is sufficient for ($(sj) — w (sj)) to be orthogonal to each of 
the elements of 0(s) and R(s), and this is guaranteed if the elements of 0(s) are in the span 
of R(s) as required by Condition |2} 

We now define the p x p matrix A = J R'^(s)^R'^(s)(iG'(s) and set 

/(7) = (7^A7) (7^ At)"', 
so that (3oo,n* = Pf{.l)- The gradient of /(7) is 

D/(7) = -2(7TA7)(7TA7)-2A7 + A^Y^A-f, 
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and its Hessian is 



- 2(7 ' A7)(7 ' A7)-^A + 8(7 ' A7)(7 ' A7)"''A77 ' A 



2(7^A7)"^A77'^A 



2(7' A7)"2A77' A. 



A second order Taylor expansion of /(7) can be written 



/(7) 



1 - {j'^Aj)-\j - 7)TA7 - (7TA7)-1(7 - 7)TA(7 - 7) 
+ 2(7^A7)"2(7 - 7)^A77^A(7 - 7) 

J (w'^is) -w''{s))w%s)dG{s) J {w%s) - w^is))^ dG{s) 
Jw^{sydG{s) /w7^(s)2rfG(s) 

(/(u;^(s) - w^(s))M;^(s)ciG(s))' 

/ (^^=(8) - w%s)) w%s)dG{s) J {w%s) - w%s) f dG{s) 



+ 2- 



+ 2 



/w^(s)2dG'(s) /u;^(s)2dG'(s) 

/ [W'js^) - W%Si)) {W'{S2) - W'{S2)) W%Si)w'{s2)dG{Si)dG{s2) 

{J w^is)dGis))' 



We require sufficient regularity such that the higher order terms converge in distribution to 
zero sufficiently fast as n* — > 00. Equations (3.6) and (3.7) are then derived by taking the 



asymptotic expectation and asymptotic variance of the expression above (for the asymptotic 
variance, we can restrict our attention to the ffist non-constant term above). This requires 
interchanging asymptotic expectations with respect to 7 with integrals in s. A sufficient 
condition to do this is that R(s) is bounded as a function of s. The interchange is accomplished 
by expressing asymptotic expectations as standard expectations of the appropriate limiting 
distributions of functions of v^(7 — 7) using Definition [l| invoking the continuous mapping 
theorem to express these limiting distributions as fu nctions of the limiting distribution of 
-\/n*(7— 7), and then using the Fubini-Tonelli theorem ( Folland|l999 ) to exchange the order of 
integration between 7 and of s. There is an additional step to express asymptotic expectations 
as the asymptotic covariances in (3.6) and (3.7). We use Definition [l] to express the asymptotic 
expectations as standard expectations of the appropriate limiting distributions of functions of 
A/n*(7 — 7), again invoke the continuous mapping theorem, express the resulting expectations 
as covariances of the limiting distributions of functions of ^A^*(7 — 7), and finally write the 
resulting expression as asymptotic covariances using Definition [Tj □ 



B Estimating the asymptotic bias 



To estimate the asymptotic bias based on (3.8), we require the second derivatives 
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Differentiating with respect to nij and gives us a long expression involving ffist and second 



21 



derivatives of M: 



(R*^MR*) ^R*^^R*(R*^MR*) ^R*^^R*(R*^MR*) ^R*^M** - 

^ ^ orrik orrij ^ ' 

Q2A/r 

(R* MR*)"^R* — R*(R* MR*)"^R* M** + 

^ ' omjOmk 

(R*^MR*)"^R*^^R*(R*^MR*)"^R*^^R*(R*^MR*)"^R*^M$* - 
^ ^ orrij ^ ' orrik 

(r*^mr*)~'r*^-^r*(r*^mr*)"'r*^-^$* - 

^ ^ orrij ^ ' orrik 

(R*^MR*) ~'r*^ ^R* (R*^MR*) ~'r*^ + 

^ ^ orrik orrij 

^ ' OmjOmk 

The first derivative with respect to mj is a matrix of zeros with a single one in the jth diagonal 
position, while the second partial derivative is a matrix of zeros. The terms that remain in the 
expression can be easily calculated based on the observed R* and using the plug-in estimate, 
X* ^ {xl,...,x*^.y, for Note that R*^^R* = R*R*^, where R* is the jth row of R*, 

andR*^|^X* = Rfx*. 
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Figure 1: Density plots of (3 show bias from Ui^BL in the one dimensional simulation scenario 
with n* = 1000 when the compatibility conditions are not satisfied (1,000 Monte Carlo simu- 
lations for each scenario), (a) Conditions 1 and 2 are satisfied: subject and monitor locations 
have the same density, and the health model is fit without subject specific covariates. (b) 
Condition 1 is violated: same as (a) except that g{s) is uniform on (0, 10). (c) Conditions 1 
and 2 are satisfied: subject and monitor locations have the same density, the health model is 
fit with a sinusoidal covariate, and the additional covariate is included in the exposure model, 
(d) Condition 2 is violated: same as (c) except that the sinusoidal covariate is not included in 
the exposure model. 




Figure 2: The two-step measurement error correction method adjusts for bias and gives vahd 
standard error estimates in the one dimensional simulation scenario with n* = 200 (1,000 
Monte Carlo simulations for each scenario and 100 bootstrap samples), (a) Uncorrected health 
effect estimates have noticeable bias, especially for high dimensional exposure models. Our 
analytical correction successfully adjust for the bias, (b) The combination of bias adjustment 
and bootstrap standard errors gives 95% confidence intervals with nominal coverage properties. 
Neither bias adjustment alone nor bootstrap standard errors alone are sufficient. 
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Scenario 1 5 df (i?^ = 0.78) 10 df {R^ = 0.91) 
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Figure 3: Spatial component of the exposure surface for each of two scenarios in the spatial 
simulation study based on MESA Air. Each scenario corresponds to a different random draw 
of $i(-) from a Matern-based Gaussian process with range 20, differentiability parameter 1, 
and variance on 5* C of 30. The first column is the true surface, and the second and third 
columns show prediction surfaces and from approximating $i(s) with thin-plate splines 
using the indicated degrees of freedom, based on fitting a fixed degree-of-freedom spatial GAM 
model to $i(s) with observations at every location on the 257 x 257 grid S. 
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Figure 4: Subject and monitor locations in the Baltimore MESA Air dataset. 
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Table 1: Results for spatial simulation (1,000 Monte Carlo simulations for each scenario and 
100 bootstrap samples). The average out-of-sample is given in parentheses for each ex- 
posure model. The first column is the relative bias in estimating /3 = 0.1. This is the same 
for (T^ = 200 and = 10 and is estimated from 100,000 Monte Carlo samples, resulting in 
negligible Monte Carlo error. The final six columns show the standard deviation, average 
estimated standard error, and 95% confidence interval coverage, separately for cr^ = 200 and 
^2 = 10. 

0-2 = 200 cj2 = 10 





Rel Bias 


SD 


E(SE) 


Gov 


SD 


E(SE) 


Gov 


Scenario 1 


5 degrees of freedom (0.75) 
















no correction 


-0.027 


0.084 


0.083 


94% 


0.02 


0.019 


93% 


bootstrap standard error only 


-0.027 


0.084 


0.084 


95% 


0.02 


0.021 


95% 


bias correction only 


-0.009 


0.086 


0.083 


94% 


0.021 


0.019 


93% 


bias correction + bootstrap 


-0.009 


0.086 


0.086 


95% 


0.021 


0.021 


96% 


10 degrees of freedom (0.79) 
















no correction 


-0.039 


0.08 


0.08 


95% 


0.019 


0.018 


93% 


bootstrap standard error only 


-0.039 


0.08 


0.082 


96% 


0.019 


0.027 


98% 


bias correction only 


-0.025 


0.081 


0.08 


94% 


0.019 


0.018 


93% 


bias correction + bootstrap 


-0.025 


0.081 


0.088 


97% 


0.019 


0.03 


99% 


Scenario 2 


5 degrees of freedom (0.42) 
















no correction 


-0.125 


0.099 


0.096 


94% 


0.025 


0.022 


87% 


bootstrap standard error only 


-0.125 


0.099 


0.097 


95% 


0.025 


0.026 


90% 


bias correction only 


-0.049 


0.108 


0.096 


93% 


0.028 


0.022 


86% 


bias correction + bootstrap 


-0.049 


0.108 


0.107 


95% 


0.028 


0.03 


94% 


10 degrees of freedom (0.59) 
















no correction 


-0.102 


0.087 


0.085 


93% 


0.021 


0.019 


88% 


bootstrap standard error only 


-0.102 


0.087 


0.085 


94% 


0.021 


0.03 


94% 


bias correction only 


-0.061 


0.091 


0.085 


92% 


0.023 


0.019 


88% 


bias correction + bootstrap 


-0.061 


0.091 


0.094 


95% 


0.023 


0.036 


97% 
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Table 2: Results of analysis analyzing the association between elevated left ventricular mass 
index (LVMI) and residential concentrations of NOx in the MESA Air cohort in Baltimore, 
based on exposure models with land-use regression plus a thin-plate spline regression with 
varying degrees of freedom. The two columns display estimates and standard errors for the 
increase in LVMI {g/m?') associated with a 10 ppb increase in NOx concentration. Cross- 
validated li^ at snapshot monitor locations for each exposure model are given in parentheses, 
based on leave-one-out cross-validation modified to leave out all members of a roadway gradient 
cluster together. 







/3 


SE 


Land-use regression only (0.60) 










no correction 


0. 


,66 


0. 


,62 


bootstrap standard error 


0. 


,66 


0. 


,66 


bias correction + bootstrap 


0. 


,68 


0. 


,68 


5 degrees of freedom (0.68) 










no correction 


0, 


,68 


0, 


,55 


bootstrap standard error 


0. 


,68 


0. 


,62 


bias correction -|- bootstrap 


0. 


,69 


0. 


,67 


10 degrees of freedom (0.41) 










no correction 


0. 


,79 


0, 


,69 


bootstrap standard error 


0. 


,79 


0, 


,66 


bias correction -|- bootstrap 


0. 


,85 


0. 


,78 
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